Craftcans.com: scraping with BeautifulSoup

This notebook will explain how to get the same data however using BeautifulSoup package instead of pandas.


In [1]:
import requests, pandas
from BeautifulSoup import *

In [2]:
url = "http://craftcans.com/db.php?search=all&sort=beerid&ord=desc&view=text"

Option 3: BeautifulSoup


In [3]:
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page)

If one goes to the website and uses the inspect element feature of Google chrome, then it can be seen that this table (although has no class or ID) jas a style attrbute of width:100%;margin-top:10px; value. We can use it to identify the correc ttable from the page.


In [4]:
table = soup.find("table",attrs={"style":"width:100%;margin-top:10px;"})

Now once we found the table, we need to go row-by-row, read all the columns for each row and save the text inside. Let's save it as a dictionary, and then paste all the dictionaries into a lsit (thus, get a JSON file). Please note, that the BEER column is a bit different: the value inside table cell is in bold (e.g. <b> tag). Thus we should first find the <b> tag, and then only go for the text content.


In [5]:
# find all the rows of the table and save them into the rows variable
rows = table.findAll("tr")
# create and empty list to be filled in with dictionaires
data_list = []
# for each row in the list of rows:
for row in rows:
    columns = row.findAll("td") # find all columns in that row
    # and create a dictionary, where we give the key and get the text content as value
    beer = {
        "id":columns[0].text,
        "beer":columns[1].find('b').text,
        "brewery":columns[2].text,
        "location":columns[3].text,
        "style":columns[4].text,
        "size":columns[5].text,
        "abv":columns[6].text,
        "ibu":columns[7].text
    }
    # append the dictionary to the list
    data_list.append(beer)

Let's see the result. The first 5 dictionaires must be enough.


In [7]:
data_list[:5]


Out[7]:
[{'abv': u'ABV',
  'beer': u'BEER',
  'brewery': u'BREWERY',
  'ibu': u'IBUs',
  'id': u'ENTRY',
  'location': u'LOCATION',
  'size': u'SIZE',
  'style': u'STYLE'},
 {'abv': u'4.5%',
  'beer': u'Get Together',
  'brewery': u'NorthGate Brewing',
  'ibu': u'50',
  'id': u'2692.',
  'location': u'Minneapolis,MN',
  'size': u'16 oz.',
  'style': u'American IPA'},
 {'abv': u'4.9%',
  'beer': u"Maggie's Leap",
  'brewery': u'NorthGate Brewing',
  'ibu': u'26',
  'id': u'2691.',
  'location': u'Minneapolis,MN',
  'size': u'16 oz.',
  'style': u'Milk / Sweet Stout'},
 {'abv': u'4.8%',
  'beer': u"Wall's End",
  'brewery': u'NorthGate Brewing',
  'ibu': u'19',
  'id': u'2690.',
  'location': u'Minneapolis,MN',
  'size': u'16 oz.',
  'style': u'English Brown Ale'},
 {'abv': u'6.0%',
  'beer': u'Pumpion',
  'brewery': u'NorthGate Brewing',
  'ibu': u'38',
  'id': u'2689.',
  'location': u'Minneapolis,MN',
  'size': u'16 oz.',
  'style': u'Pumpkin Ale'}]

If you are more comfortable with working in Dataframes, when the conversion can easility be done.


In [8]:
data = pandas.DataFrame(data_list)

In [9]:
data.head()


Out[9]:
abv beer brewery ibu id location size style
0 ABV BEER BREWERY IBUs ENTRY LOCATION SIZE STYLE
1 4.5% Get Together NorthGate Brewing 50 2692. Minneapolis,MN 16 oz. American IPA
2 4.9% Maggie's Leap NorthGate Brewing 26 2691. Minneapolis,MN 16 oz. Milk / Sweet Stout
3 4.8% Wall's End NorthGate Brewing 19 2690. Minneapolis,MN 16 oz. English Brown Ale
4 6.0% Pumpion NorthGate Brewing 38 2689. Minneapolis,MN 16 oz. Pumpkin Ale

Let's this time save the resulted data to a JSON file.


In [10]:
import json
with open("craftcans.json","w") as f:
    json.dump(data_list,f,sort_keys = True, indent = 4)